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Abstract 

Since its introduction in 2004, the MapReduce framework has be- 
come one of the standard approaches in massive distributed and paral- 
Q lei computation. In contrast to its intensive use in practise, theoretical 

^ footing is still limited and only little work has been done yet to put 

Q MapReduce on a par with the major computational models. Follow- 

ing pioneer work that relates the MapReduce framework with PRAM 
t— l and BSP in their macroscopic structure, we focus on the functionality 

provided by the framework itself, considered in the parallel external 
\^ memory model (PEM). In this, we present upper and lower bounds on 

t" — the parallel I/O-complexity that are matching up to constant factors 

for the shuffle step. The shuffle step is the single communication phase 
where all information of one MapReduce invocation gets transferred 
from map workers to reduce workers. Hence, we move the focus to- 
t— ( wards the internal communication step in contrast to previous work. 

The results we obtain further carry over to the BSP* model. On the 
one hand, this shows how much complexity can be "hidden" for an algo- 
j_J rithm expressed in MapReduce compared to PEM. On the other hand, 

our results bound the worst-case performance loss of the MapReduce 
approach in terms of I/0-emciency. 



1 Introduction 

The MapReduce framework has been introduced by Dean and Ghemawat |5] 
to provide a simple parallel model for the design of algorithms on huge data 
sets. It allows an easy design of parallel programs that scale to large clusters 
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of hundreds or thousands of PCs. Since its introduction in 2004, apart from 
its intensive use by Google for tasks involving petabytes of data each day [5J, 
the open source implementation Hadoop [16j has found many applications 
including regular use by companies like Yahoo!, eBay, Facebook, Twitter 
and IBM. This success can be traced back to both the short development 
time of programs even for programmers without experience in parallel and 
distributed programs, and the fast and fault tolerant execution of many tasks. 

However, there is also criticism passed on current progression towards 
MapReduce [121 IS] • This includes criticism on the applicability of MapRe- 
duce in all its simplicity to tasks where more evolved techniques have been 
examined already. Hence, it is of high importance to gain an understand- 
ing when and when not the MapReduce model can lead to implementations 
that are efficient in practise. In this spirit, MapReduce has been compared 
to PRAM [ 1 1 J and BSP [9 J by presenting simulations in MapReduce. But 
theoretical foundations are still evolving. 

Especially in high performance computing, the number of gigaflops pro- 
vided by today's hardware is more than sufficient, and delivering data, i.e., 
memory access, usually forms the bottleneck of computation. In clusters con- 
sisting of a large number of machines, the communication introduces another 
bottleneck of a similar kind. Therefore, we believe that the (parallel) exter- 
nal memory model |TJ [2] provides an important context in which MapReduce 
should be examined. 

In this paper, we provide further insights, contributing to the discussion 
on applicability of MapReduce, in that we shed light on the I/O-efficiency 
loss when expressing an algorithm in MapReduce. On the other hand, our 
investigation bounds the complexity that can be "hidden" in the framework 
of MapReduce in comparison to the parallel external memory (PEM) model. 
This serves for a direct lower bound on the number of rounds given a lower 
bound on the I/O-complexity in the PEM model. The main technical contri- 
bution of this work is the consideration of the shuffle step which is the single 
communication phase between processors / workers during a MapReduce 
round. In this step, all information is redistributed among the workers. 

MapReduce Framework The MapReduce framework can be understood 
as an interleaved model of parallel and serial computation. It operates in 
rounds where within one round the user-defined serial functions are executed 
independently in parallel. Each round consists of the consecutive execution of 
a map, shuffle and reduce step. The input is a set of {key, value) pairs. Since 
each mapper and reducer is responsible for a certain (known) key, w.l.o.g. we 
can rename keys to be contiguous and starting with one. 
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A round of the MapReduce framework begins with the parallel execution 
of independent map operations. Each map operation is supplied with one 
(key, value) pair as input and generates a number of intermediate (key, value) 
pairs. To allow for parallel execution, it is important that map operations 
are independent from each others and rely on a single input pair. In the 
shuffle step, the set of all intermediate pairs is redistributed s.t. lists of pairs 
with the same key are available for the reduce step. The reduce operation 
for key k gets the list of intermediate pairs with key k and generates a new 
(usually smaller) set of pairs. 

The original description, and current implementations realise this frame- 
work by first performing a split function to distribute input data to workers. 
Usually, multiple map and reduce tasks are assigned to a single worker. Dur- 
ing the map phase, intermediate pairs are already partitioned according to 
their keys into sets that will be reduced by the same worker. The intermedi- 
ate pairs still reside at the worker that performed the map operation and are 
then pulled by the reduce worker. Sorting the intermediate pairs of one re- 
duce worker by key finalises the shuffle phase. Finally, the reduce operations 
are executed to complete the round. A common extension of the framework 
is the introduction of a combiner function that is similar in spirit to the 
reduce function. However, a combine function is already applied during the 
map execution, as soon as enough intermediate pairs with the same key have 
been generated. 

Typically, a MapReduce program involves several rounds where the out- 
put of one round's reduce functions serves as the input of the next round's 
map functions j6]. Although most examples are simple enough to be solved in 
one round, there are many tasks that involve several rounds such as comput- 
ing page rank or prefix sums. In this case a consideration of the shuffle step 
becomes most important, especially when map and reduce are I/O-bounded 
by writing and reading intermediate keys. If map and reduce functions are 
hard to evaluate and large data sets are reduced in their size by the map 
function, it is important to find evolved techniques for the evaluation of 
these functions. However, this shall not be the focus of our work. 

One can see the shuffle step as the transposition of a (sparse) matrix: 
Considering columns as origin and rows as destination, there is a non-zero 
entry Xij iff there is a pair (i,Xij) emitted by the jth map operation (and 
hence will be sent to reducer i). Data is first given partitioned by column, and 
the task of the shuffle step is to reorder non-zero entries row-wise. Note that 
there is consensus in current implementations to use a partition operation 
during the map operation as described above. This can be considered as a 
first part of the shuffle step. 
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Related work Feldman et al. [8] started a first theoretical comparison of 
MapReduce and streaming computation. They address the class of symmet- 
ric functions (that are invariant under permutation of the input) and restrict 
communication and space for each worker to be polylogarithmic in the input 
size N (but mention that results extend to other sublinear functions). In [llj, 
Karloff et al. state a theoretical formulation of the MapReduce model where 
space restriction and the number of workers is limited by iV 1_<E . Similarly, 
space restrictions limit the number of elements each worker can send or re- 
ceive. In contrast to other theoretical models, they allow several map and 
reduce tasks to be run on a single worker. For this model, they present an 
efficient simulation for a subclass of EREW PRAM algorithms. Goodrich et 
al. [H] introduce the parameter M to restrict the number of elements sent or 
received by a machine. Similar to the external memory model where computa- 
tion is performed in a fast cache of size M, this introduces another parameter 
additionally to the input size N. Their MapReduce model compares to the 
BSP model with M-relation, i.e., a restricted message passing degree of M 
per super-step. The main difference, is that in all the MapReduce models 
information cannot reside in memory of a worker, but has to be resent to 
itself to be preserved for the next round. A simulation of BSP and CRCW 
PRAM algorithms is presented based on this model. 

The restriction of worker-to- worker communication allows for the number 
of rounds to be a meaningful performance measure. As observed in 
and [9j, without restrictions on space / communication there is always a 
trivial non-parallel one-round algorithm where a single reducer performs a 
sequential algorithm. 

On the experimental side, MapReduce has been applied to multi-processor 
/ multi-core machines with shared memory |13| . They found several classes 
of problems that perform well in MapReduce even on a single machine. 

As described above, the shuffle step can be considered as a matrix transpo- 
sition. Following the model of restricting communication, the corresponding 
matrix is restricted to have a certain maximum number of non-zero entries 
per column and row. Previous work considered the multiplication of a sparse 
matrix with one vector |3] and several vectors simultaneously [TU] in the ex- 
ternal memory model. The I/O- complexity of transposing a dense matrix 
was settled in the seminal paper by Aggarval and Vitter [1] that introduced 
the external memory model. A parallel version of the external memory model 
was proposed by Arge et al. [2]. 

Our contribution We provide upper and lower bounds on the parallel 
I/O-complexity of the shuffle step. In this, we can show that current im- 



4 



1 INTRODUCTION 



plementations of the MapReduce model as a framework are almost optimal 
in the sense of worst-case asymptotic parallel I/O-complexity. This further 
yields a simple method to consider the external memory performance of an 
algorithm expressed in MapReduce. Since we consider the PEM model, we 
assume a shared memory. However, our results can be applied to any layer of 
the memory hierarchy, and also extend to the communication layer, i.e. the 
network. This is expressed by a comparison to the BSP* model. 

Following the abstract description of MapReduce (9j[H], the input of each 
map function is a single (key, value) pair. The output of reduce instead can 
be any finite set of pairs. In terms of I/O complexity, however, it is not 
important how many pairs are emitted, but rather the size of the input / 
output matters. 

We analyse several different types of map and reduce functions. For map, 
we first consider an arbitrary order of the emitted intermediate pairs. This is 
most commonly phrased as the standard shuffle step provided by a framework. 
Another case is that intermediate pairs are emitted ordered by their key. 
Moreover, as a last case, we allow evaluations of a map function in parallel 
by multiple processors. For reduce, we consider the standard implementation 
which guarantees that a single processor gets data for the reduce operations 
ordered by intermediate key. Additionally, we consider another type of reduce 
which is assumed to be associative and parallelisable. This is comparable to 
the combiner function described before (cf. [3]). In this case, the final result 
of reduce will be created by a binary tree-like reduction of the partial reduce 
results that were generated by processors holding intermediate results from 
the same key. For the cases where we actually consider the evaluation of 
map and reduce functions, we assume that input / output of a single map / 
reduce function fits into internal memory. We further assume in these cases 
that input and output read by a single processor does not exceed the number 
of intermediate pairs it accesses. Otherwise, the complexity of the task can be 
dominated by reading the input, writing the output respectively, which leads 
to a different character that is strongly influenced by the implementation of 
map and reduce. For the most general case of MapReduce, we simply assume 
that intermediate keys have already been generated by the map function, and 
have to be reordered to be provided as a list to the reduce workers. 

For our lower bounds to be matching, we have to assume that the number 
of messages sent and received by a processor is restricted. More precisely, for 
Nm being the number of map operations and Nr the number of reducers, we 
require that each reducer receives at most N]^ 1 intermediate pairs, and each 
mapper emits at most iVjj -7 where 7 depends on the type of map operation. 
However, for the first and the second types of map as described above, any 
7 > is sufficient. 
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Outline In the next Section, we give a description of the PEM. This will 
be followed by a comparison of the PEM and BSP* model. In Section |4| 
we present algorithms for the shuffle step for all considered map and reduce 
types. Our lower bounds that match up to constant factors are given in 
Section [5] 

2 The parallel external memory model 

The classical external memory model introduced in |lj assumes a two-layer 
memory hierarchy. It consists of an internal memory (cache) that can hold up 
to M elements, and an external memory (disk) of infinite size, organised in 
blocks of B elements. Computations and rearrangement of elements can only 
be done with elements residing in internal memory. With one I/O, a block 
can be moved between internal and external memory. This models quite well 
the design of current hardware with a hierarchy of faster and smaller caches 
towards the CPU. Disk accesses are usually cost-intensive and hence many 
contiguous elements are transferred at the same time. 

As a parallel version of this model, the PEM was proposed by Arge et 
al. [2] replacing the single CPU-cache by P parallel caches and CPUs that 
operate on them (cf. Figure [TJ. External memory is treated as shared mem- 
ory, and within one parallel I/O, each processor can perform an input or an 
output of its internal memory to disk. Similar to the PRAM model, one 
has to define how overlapping access is handled. In this paper, we assume 
concurrent read, exclusive write (CREW). However, the results can be easily 
modified for CRCW or EREW. 




••• 

• • • ■ ■ ■ 

Figure 1: The parallel external memory model (PEM). 

3 A comparison to BSP* 

For a BSP model comparable with PEM, we assume that computational 
costs are dominated by communication. One of the parameters in the BSP 
model is the /i-relation, the number of messages each processor is allowed to 
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send and to receive. Additionally, a latency / startup cost L per super-step is 
assumed. The total cost of an algorithm with T super-steps is T ■ (h + L). All 
this conforms with the definition of BSP in [15], when normalising to network 
throughput g = 1. However, as a significant change compared to [15J one can 
define the latency L in relation to the number of connections each processor 
has to establish. This is justified for hand shake protocols but also for the 
encapsulation process of messages performed by the network layers in the OSI 
model, i.e., todays network protocols. Hence, an incentive is given to send 
a number of elements of magnitude comparable to the connection-latency to 
the same processor. Another way to express this is the BSP* model [3j that 
encourages block-wise communication by defining a cost model gh \s/B] + L 
per super-step for maximum message length s. 

The version that is best comparable with PEM is the BSP* with 1-relation 
(i.e., h = 1). Assuming g > L, we can restrict ourselves to s < B. Otherwise 
a message has to be divided into multiple super-steps which only changes 
costs by a constant factor. Any such 1-BSP* algorithm with I super-steps 
on input size N can be simulated in the EREW PEM with 2£ parallel I/O, 
if input and output are equally distributed over internal memories and M 
is sufficiently large. In this, one parallel-output of the blocks that are to be 
sent is done per super-step. In a following parallel input, these blocks are 
input by their destined processor. Hence, all our lower bounds hold for the 
1-BSP* where M can be set to any arbitrarily large value (e.g. N) such that 
N/P < M. Note that the simulation can be non- uniform which, however, 
does not change the statement. 

Similarly, a 1-BSP* algorithm can be derived from an EREW PEM algo- 
rithm for M = N/P. For each output that is made by the PEM algorithm, 
the corresponding block is sent to the processor that will read the block near- 
est in the future. In general, this implies that multiple blocks can be sent 
to the same processor violating the 1-relation. However, our algorithms are 
easily transformed to avoid this problem. Furthermore, assignment of input 
and output is not necessary in the BSP model, and the applied parallel sort- 
ing algorithm is even derived from a BSP algorithm (see [2]). We omit a 
detailed description here since it is not the focus of our work. 

4 Upper bounds for the shuffle step 

We start with the description of some basic building blocks that are performed 
multiple times within our algorithms. To form a block from elements that are 
spread over several internal memories, processors can communicate elements 
in a tree-like fashion to form the complete block in log min {P, B} I/Os. This 
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is referred to as gathering. If a block is created by computations involving 
elements from several processors (e.g., summing multiple blocks) still logP 
I/Os are sufficient. Similarly, a block can be spread to multiple processors in 
logP I/Os (scattering). If n blocks for each processor have to be scattered 
/ gathered independently, this can be serialised such that n + logP I/Os are 
sufficient. Note that for all gather and scatter tasks, the communication 
structure has to be known to each participant. This is for example the case 
when participating processors constitute an ordered set which is known to 
all. Additionally, we require the computation of prefix sums. This task has 
been extensively studied in parallel models. For the PEM model see [2] for 
a description. 

Range-bounded load-balancing For load balancing reasons, the follow- 
ing task will appear several times. Given n tuples (i, x) that are located in 
contiguous external memory, ordered by some non-unique key i G {1, . . . , m} 
for m < n. Assign the n tuples to P processors such that each processor gets 
n/P tuples assigned to it, but keys are within a range of size m/P. 

The task can be solved by dividing the P processors into |~P/2] volume 
processors and |_i 3 /2j range processors. First, data is assigned to the volume 
processors by volume. In this, each volume processor gets a consecutive piece 
of \2n/P] tuples assigned to it. For communication purpose, we assume that 
for each processor there is an exclusive block reserved for messages in external 
memory denoted as inbox. 

For the next step, think of the ordered set of keys (1, . . . , m) being par- 
titioned into P/2 ranges of at most |~2m/P] continuous keys each (further 
referred to as key range). To cope with the problem of having too many dif- 
ferent keys assigned to the same volume processor, we reserve the ith range 
processor to become responsible for tuples within the (i + l)th key range. 
Now, each volume processor scans its assigned area and keeps track of the 
position in external memory where a new key range begins. This requires 
only I/Os for scanning the assigned tuples while one block in memory is 
reserved to buffer and output the starting positions of a new range. 

Afterwards, each volume processors with more than [2m/P] keys as- 
signed to it, created a list of memory positions where a new key range begins. 
This list is than scattered block-wise to the range processors reserved for the 
corresponding key ranges. Although we assume CREW, it is necessary to 
distribute this information because a program running on a range processor 
is not aware of the memory position to find the information for it depends on 
which volume processor got assigned the key range. The distribution is pos- 
sible in logmin {B, P} parallel I/Os where each block of a list is written into 
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the inbox of the first range processor concerned by this block, and from there 
spread in an binary manner to the other range processors (cf. Figure [2]). In 
this, each processor can be informed about both the beginning and the end 
of its assigned area. Note that it can never happen that a range processor 
receives information from multiple volume processors because for a key range 
divided to multiple volume processors only the first can dispose the tuples. 




Figure 2: Scattering the list of key ranges. 



Contraction Given an input sequence of n elements written in m blocks 
on disk, where some of the memory cells are empty, it is possible to contract 
the sequence in m/P + log P I/Os such that empty cells are removed and 
the sequence is written in the same ordering as before, but with elements 
stored contiguously. To this end, the input blocks are assigned equally to 
the P processors. Each processor gets a contiguous peace of up to \m/P~\ 
blocks assigned to it, and processors are assigned in ascending order. Within 
one scan, each processor can determine the number of non-empty cells con- 
tained in its assigned area. With this information, and the given ordering 
of processors, using prefix sum computation, in logP I/Os, it is known to 
each processor where its elements shall be placed in a contiguous output. 
For each block of the contiguous output, the processor with the first, and 
the processor with the last element assigned to it are in knowledge of the 
fact, given the prefix sum results. Since processors are assigned in ascending 
order to the pieces, a gather operation can be used to create blocks with 
elements assigned to several processors. Note that each processor only needs 
to participate in at most two gather operations. The gather process can be 
achieved by an output of each first and last processor of a block, in that 
both write their indices into a designated table. Then, each processor can 
read this information and send its elements to the inbox of the responsible 
processor in the gather process. 

Algorithms 

For a clearer understanding of the shuffle step, we use the analogy of a 
sparse matrix. Let N M be the number of distinct input keys, and N R be the 
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number of distinct intermediate keys (i.e., independent reduce runs). Each 
pair (i,Xij) emitted by map operation j can be considered a triple (i,j,Xij). 
Using this notation, one can think of a sparse Nr x Nm matrix with non- 
zero entries Xij. This matrix is given in some layout determined by the map 
function and has to be either reordered into a row-wise ordering, or a layout 
where rows can be reduced easily. In the following, we consider input keys 
as column indices and intermediate keys as row indices. The total number 
of intermediate pairs / non-zero elements is denoted H. Additionally, we 
have w, the number of elements emitted by a reduce function, and v , the 
size of the input to a map function, v , w < min {M — B, \H/P~\} as argued 
in the introduction. An overview of the algorithmic complexities is given in 
Table [TJ For space restrictions the term log P is omitted in Table [T] We use 
log fe x := max {log fc x, 1}. The complexities given in Table [l] only differ from 
the descriptions in that we distinguish the special case R — 1, and make 
use of the observation \og d (x/d) = log d x. For all our algorithms we assume 
H/P > B, i.e., there are less processors than blocks in the input such that 
each processor can get a complete block assigned to it. 





Non-parallel reduce 


Parallel reduce 


Unordered map 




H ]w N R W 


Sorted map 


f_ 15 g d min {N M N R B^ Rt 


PB lo §d mm \ H ' B j 


Parallel map 


* B log, min {"MA^^j 


H 1F r-, N M N R vw 
PB od BH 


Direct shuffling 
Complete merge 


H/P (non-uniform) 
P~B~ 1Q Sd B 



Table 1: Overview of the algorithmic complexities with d = 
mm{M/B,H/(PB)}. 

Direct shuffling 

Obviously, the shuffle step can be completed by accessing each element once 
and writing it to its destination. For a non-uniform algorithm, that is, with 
knowledge of the non-zero positions, H/P parallel I/Os are sufficient. To 
this end, the output can be partitioned into H/P consecutive parts. Since 
we assume H/P > B, collisions when writing can be avoided. In contrast, 
because we consider CREW reading the elements in order to write them 
to their destination is possible concurrently. We restrict ourselves in this 
case to a non-uniform algorithm to match the lower bounds in Section [5j 
This shows that our lower bounds are asymptotically tight. Such a direct 
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shuffle approach can be optimal. However, for other cases that are closer to 
real world parameter settings a more evolved approach is presented in the 
following. 



Map-dependent shuffle part 



In this part, we describe for different types of map functions how to prepare 
intermediate pairs to be reduced in a next step. To this end, during the 
first step R meta-runs of non-zero entries from ranges of different columns 
will be formed. Afterwards, these meta-runs are further processed to obtain 
the final result. The meta-runs shall be internally ordered row- wise (aka row 
major layout). If intermediate pairs have to be written in sorted order before 



the reduce operation can be applied, we set R 



H 

NrB 



Otherwise, if the 



reduce function is associative, it will suffice to set R 



H 



Nr max{m,B} 



Non-parallel map, unordered intermediate pairs We first consider 
the most general (standard) case of MapReduce where we only assume that 
intermediate pairs from different map execution are written in external mem- 
ory one after another. The elements are ordered by column but within a 
column no ordering is given. We refer to this as mixed column layout. This 
given ordering can only be exploited algorithmically for non-parallel reduce 
functions where a row major layout has to be constructed. 

We apply the parallel merge sort by Arge et al. [2j to sort elements by 
row index. This algorithm divides data evenly upon the P processors and 
starts with an optimal (single processor) external memory algorithm such 
as the merge sort in [1] to create P presorted runs of approximately even 
size. Then, these runs are merged in a parallel way with a merging degree 

\d] for d = max {2, min ^H/(PB), y/H/P, M/fijj. In contrast to [2\, we 

slightly changed this merging degree in that we added the term H / (PB) 
to the minimum. This widens the parameter range given in [2J without 
changing the complexity within the original range and guarantees matching 
lower bounds. Note that log H/ (PB) < 2 log y/ H/P such that for asymptotic 
considerations d = max {2, H/(PB), M/B} is sufficient, fnstead of a full 
merge sort, we stop the merging process when the number of runs is less 
than R. Note that if P < R, we actually skip the parallel merging and 
perform only a local merge sort on each processor. In either case, we get a 
parallel I/O-complexity of ^ log d + log P. 
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Non-parallel map, sorted intermediate pairs Here, we assume that 
within a column, elements are additionally ordered by row index, i.e. in- 
termediate pairs are emitted sorted by their key. This corresponds to the 
column major layout. In the following, we assume H/Nr > B. Otherwise, 
the previous algorithm is applied, or simply columns are merged together as 
described in a later paragraph. 

Since columns are ordered internally, each column can serve as a pre- 
sorted run. Thus, we assign elements using the range-bounded load-balancing 
method, with column indices serving as keys. With this assignment we run 
the parallel merge sort, but with the following modifications. Again, we 
stop the merging processes as soon as less than R runs remain. Instead of 
the general external memory sorting algorithm that is used locally on each 
processor, we use the M/B-way merge sort to merge all the up to Nm/P 
columns that are assigned to one processor. For P < R this local merge sort 
finishes the task by reducing the Nm columns into R runs. If P > R the 
local merging creates P runs out of the N M columns. Then, in the parallel 
merge phase, the P runs are reduced to R runs. The total I/O-complexity 
of this algorithm is \og d ^ + log P. 

Parallel map, sorted intermediate pairs In the following, we describe 
the case with the best possible I/O-complexity for the shuffle step when 
no further restrictions on the distribution of intermediate keys are made. 
This is the only case where we actually consider the execution of the map 
function itself. Note that in terms of I/O-complexity, an algorithm can emit 
intermediate pairs from a predefined key range only. This is possible since 
intermediate pairs are generated in internal memory, and can be removed 
immediately without inducing an I/O, while pairs within the range of interest 
are kept. In a model with considerations of the computational cost, it would 
be more appropriate to consider a map function which can be parallelised to 
emit pairs in a predefined intermediate key range. 

We first describe the layout of intermediate pairs in external memory that 
shall be produced. Let m = min{M — B, \H/P]}. This time, intermediate 
pairs are not simply ordered by column primarily as before, but data is split 
into ranges of m/v columns and within each such meta-column elements 
are ordered row-wise. When creating this layout, each processor can keep 
its memory filled with the m input elements required for each meta-column 
while writing the intermediate results. This layout can be obtained by first 
determining the volume of each meta-column in parallel by requesting in- 
termediate results without writing them to disk. Using parallel prefix sum 
computation, it can be determined which processor will be assigned to which 
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volume part of a meta-column in order to realise an equal load balancing. 
This step is possible in logP I/Os. After the assignment of processors to 
meta-columns, each processor reads and keeps input pairs for a whole meta- 
column in internal memory. Using the map function, intermediate pairs are 
requested and extracted, and then written to external memory in row-wise 
order (within a meta-column). If a processor is assigned to multiple meta- 
columns, it processes each meta-column one after another. Since H/P > B, 
multiple processors never have to write to the same block at the same time. 

Because we already formed row- wise sorted meta-columns oim/v columns, 
the number of merge iterations to generate R row-wise sorted meta-runs is 
reduced. If N M /m < R, nothing needs to be done because the number of 
meta-columns is already less than the desired number of meta-runs. Oth- 
erwise, we use the parallel merge sort. Similar to the description for sorted 
intermediate pairs, the elements are assigned to processors in a balanced way, 
range-bounded now by meta-column index. As a single processor sorting 
algorithm, the meta-columns are merged by merge-sort (if less than P meta- 
columns exist). The local or the preceding parallel merging is again stopped 
when at most R runs remain. Reducing the NMv/m meta-columns into R 
meta-runs induces an I/O-complexity of log M/B min {M M H/p}R + l°g-P- 

Reduce-dependent shuffle part 

Non-parallel reduce function For the general case of non-parallel reduce func- 
tions, intermediate keys of the same key are to be provided consecutively to 
the reduce worker, i.e., a row major layout has to be created. This can be 
obtained in a direct manner as follows. We describe the current layout in 
tiles, where one tile consists of the elements in one row within a meta-column. 
The macroscopic ordering of these tiles is currently a column major layout. 
To obtain the desired layout, tiles only need to be rearranged into a row 
major layout. 

Observe that there are j^— meta-runs with N R rows each such that there 
are at most ^ non-empty tiles. Each tile can consist of at most two blocks 
that contain elements from another tile (and need to be accessed separately). 
However, these are still parallel I/Os to access these blocks. The remain- 
ing H blocks that belong entirely to a tile contribute another I/Os. 

To rearrange tiles, we assign elements to processors balanced by volume 
and range-bounded by the tile index (ordered by meta-runs first, and by row 
within a meta-run). In order to write the output in parallel, the destined 
positions of each of its assigned elements has to be known to the processors. 
To this end, each processor scans its assigned elements, and for each beginning 
of a new tile, the memory position is output into a table S (consisting of H/ B 



13 



5 LOWER BOUNDS FOR THE SHUFFLE STEP 



entries, one blocks each). Afterwards, using this table, the size of each tile 
is determined and written to a new table D. With table D a prefix sum 
computation is started in row major layout such that D now contains the 
relative output destination of each row within each meta-run. In a CRCW 
model, with the same assignment of elements as before, tiles can now be 
written to their destination to form the output using table D. 

For CREW, when first creating P, one can ceil the tile sizes to full blocks. 
The resulting layout will obviously contain blocks that are not entirely filled, 
but contain empty memory cells. However, using the contraction described 
above, one can extract these empty cells. 

The whole step to finalise the shuffle step has I/O-complexity + log P. 

Parallel (associative) reduce function Assuming a parallelisable re- 
duce, each processor shall perform multiple reduce functions simultaneously 
on a subset of elements with intermediate key in a certain range. In a final 
step, the results of these partial reduce executions are then collected and 
reduced to the final result. 

To this end, the range of intermediate keys is partitioned into \N R Pw/H~\ 
ranges of up to \H/ (Pw)~\ keys. Using the range-bounded load-balancing 
algorithm, elements (still ordered in meta-runs) are assigned to processors 
such that each processor gets elements from at most two pieces of row indices. 
This can be achieved by using the tuple (meta-run index, row index) as key. 
If a processor got assigned elements that belong to the same reduce function, 
elements can be reduced immediately by the processor. Afterwards, for each 
key range, elements can be gathered to form the final result of the reduce 
function. This is possible with jf- + logP I/Os. 

Complete sorting / merging 

For some choices of parameters, especially for small instances, it can be 
optimal to simply apply a sorting algorithm to shuffle elements row-wise. 
Using the parallel merge sort, this has I/O-complexity ^ log d ^ + log P. 
Furthermore, if the matrix is given in column major layout, the Nm already 
sorted columns can simply be merged to order elements row-wise. This results 
in an 1/ O-complexity of log d Nm + log P. 

5 Lower bounds for the shuffle step 

A simple task in MapReduce is creating the product of a sparse matrix A 
with a vector. Assuming that the matrix entries are implicitly given by the 
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map function, the task can be accomplished within one round. In this, map 
function j is supplied with input vector element Xj and emits (i,Xjdij). The 
reduce function simply sums up values of the same intermediate key. Hence, 
a lower bound for matrix vector multiplication immediately implies a lower 
bound for the shuffle step. Since reduce can be an arbitrary function, we 
restrict ourselves to matrix multiplication in a semiring, where the existence 
of inverse elements is not guaranteed. 

A lower bound for sparse N x N matrices in the I/O-model was presented 
in [1] and extended to non-square situations in [10]. These bounds are based 
on a counting argument, comparing the number of possible programs for the 
task with I I/Os to the number of distinct matrices. To this end, the maximal 
number of different configurations (content) of external and internal memory 
after one I/O is examined. In this section, we explain the main differences to 
these proofs, the complete proofs of our results can be found in Appendix [Aj 

Since we have multiple processors and assume a CREW environment, we 
have to dissociate from the perspective of moving blocks. Instead, we assume 
that a block in external memory which does not belong to the final output 
disintegrates magically immediately after it is read for the last time in this 
form. 

For a parallel I/O, the number of preceding configurations is simply raised 
to the power of P. However, in contrast to the single processor EM model, 
we have to consider the case H/P < M, i.e., not all processors can have their 
internal memory filled entirely. Instead, we consider the current number of 
elements X^i of processor i before the /th parallel I/O. The number of distinct 
configuration of a program after I I/Os is then bounded by 



Furthermore, we consider multiple input and multiple output vectors 
which leads to a combined matrix vector product. In this, any intermedi- 
ate pair - in classical matrix vector multiplication an elementary product of 
a vector element with a non-zero entry - can now be the result of a linear 
combination of the v elements in the corresponding dimension of the input 
vectors, and any output element can be a linear combination of intermediate 
pairs with corresponding intermediate key. We use a simplified version and 
consider the variant where each intermediate pair is simply a copy of one 
of the v input elements, and it is required for the computation of precisely 
one output element. Hence, for each of the H non-zero entries in our matrix, 
there is not only a choice of its position but also the choice from which of 
the v input elements it stems from, and which of the w output elements will 
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be its destination. This results in a total number of ( NM g R ) 

v H w H different 

tasks for fixed parameters Nm, Nr, H, v and w. 

Applying these modifications yields the following results. 

Theorem 1. Given parameters B, M > 3B and P < Creating the 
combined matrix vector product for a sparse Nr x Nm matrix with H non- 
zero entries for H/N R < N\j e and H/N M < N R ~ e fore>0 from v < H/N M 
input vectors to w < H/Nr output vectors has (parallel) I/O- complexity 

• fi (min {js, jr^ log rf ^-fp}) if the matrix is in mixed column layout 

• Q (min { 75, log d min { Nm ^ rW ; }}) if given in column major lay- 
out 

• and n ^min | ^ , log d H ^^^m V h/p} } ) f or ^ e best-case layout with 
H/Nr < ^]Vm and H/N M < </Nr~ 

where d = mm{M/B,H/(PB)}. 

These lower bounds already match the algorithmic complexities for par- 
allel reduce in Section |4} Moreover, a lower bound for creating a matrix in 
row major layout from v vectors can be obtained in a very similar way (cf. 
parallel map & non-parallel reduce). This task corresponds to a time-inverse 
variant of creating the matrix vector product with a matrix in column major 
layout. 

Lemma 1. Given parameters B, M > 3B and P < ^. Creating a sparse 
Nr x Nm matrix with H non-zero entries in row major layout from v vectors 
. . . such that for all non-zero entries holds dy = Xj for some k 
has (parallel) I/O- complexity 

n f . (H H . (N m Nrv N m v\\\ 

n { mm {p->PB hgdmm {-^I-^Sl) 

for H/Nr < N\j e and H/N M < N 1 ^', e > 0. 

Theorem [T] and Lemma [T] both hold not only in the worst-case, but for 
a fraction of the possible sparse matrices exponentially close to one. Hence, 
for distributions over the matrix conformations (position of the non-zero 
entries), even if not uniform but somehow skewed, the lower bounds still hold 
on average if a constant fraction of the space of matrix conformations has 
constant probability. Similar, the bounds hold on average for distributions 
where a constant fraction of the non-zero entries is drawn with constant 
probability from a constant fraction of the possible position. 
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Transposing bound 

Another method to obtain a lower bound for the I/O-complexity is presented 
in pQ. They use a potential function to lower bound the complexity of dense 
matrix transposition. This bound can also be extended to sparse matrix 
transposition in the PEM model for matrices given in column major layout. 
Combining the following bound with Theorem [T] matches the algorithmic 
complexities given in Section [4] for non-parallel reduce. 

Theorem 2. The transposition of a sparse Nr x Nm matrix with H non-zero 
entries has worst-case parallel I/O-complexity 



A bound for scatter / gather 

To cover all the algorithmic complexities, it remains to justify the scatter and 
gather tasks that are required for the exclusive write policy. A lower bound 
for sorting related problems can be found in [2|. They show a lower bound 
of VL (log N/B) = f2 (logP) on the number of I/Os. 

6 Conclusion 

We determined the parallel worst-case I/O-complexity of the shuffle step for 
most meaningful parameter settings. All our upper and lower bounds for the 
considered variants of map and reduce functions match up to constant factors. 
Although worst-case complexities are considered, most of the lower bounds 
hold with probability exponentially close to one over uniformly drawn shuffle 
tasks. We considered several types of map and reduce operations, depending 
on the ordering in which intermediate pairs are emitted and the ability to 
parallelise the map and reduce operations. All our results hold especially 
for the case where internal memory of the processors is never exceeded but 
(blocked) communication is required. This shows that the parallel external 
memory model reveals a different character than the external memory model 
in that communication can be described in the model even for the case the 
input fits into internal memories. 

Our results show that for parameters that are comparable to real world 
settings, sorting in parallel is optimal for the shuffle step. This is met by 
current implementations of the MapReduce framework where the shuffle step 
consists of several sorting steps, instead of directly sending each element to 
its destination. In practise one can observe that a merge sort usually does not 
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perform well, but rather a distribution sort does. The partition step and the 
network communication in current implementations to realise the shuffle step 
can be seen as iterations of a distribution sort. Still, our bounds suggest a 
slightly better performance when in knowledge of the block size. If block and 
memory size are unknown to the algorithm, which corresponds to the so called 
cache-oblivious model, it is known that already permuting (N M = N R = H) 
cannot be performed optimally. Sorting instead can be achieved optimally, 
but only if M > B 2 [?]. However, when assuming that the naive algorithm 
with I/Os is not optimal, and M > B 2 , all the considered variants have 
the same complexity and reduce to sorting all intermediate pairs in parallel. 
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A Lower bounds 

A lower bound for sparse N x N matrices in the I/O-model was presented 
in jl] and extended to non-square situations in [10]. We follow closely their 
description but have to restate the proof for the PEM. Additionally, we con- 
sider a task where multiple input and output elements are associated with 
each non-zero entry. More specifically, we have v input and w output vec- 
tors. The task is to multiply each non-zero entry Ojj with only one of the v 
vector elements x*p, . . . , Xj . The assignment which vector is used for which 
non-zero entry is part of the input. Furthermore, each non-zero entry is asso- 
ciated with one of the w output vectors. The ith coordinate of the Ith output 

(k) 

vector is the sum of all elementary products aijXj that were associated with 
vector I. Hence, each non-zero entry has, apart from its position in the ma- 
trix and its value, two further variables assigned to it, defining origin of the 
input vector element and destination of the elementary product. We refer to 
this task as the combined matrix vector product of a given matrix, a set of 
v input vectors, the number of output vectors w and a given assignment of 
non-zero entries to input / output vectors. 

Like in jl], the configuration at time t refers to the concatenation of non- 
empty memory cells of external and internal memory between the tth I/O 
and the t+ 1th I/O. We will need the following technical lemma for the proofs 
of the lower bounds. 

Lemma 2. Assume \og3H > \B logmin {§, H > max {N u N 2 } > 2, 
H/N 2 < Nl~ e , then 

(<) H < N 2 1/e 

(ii) N 2 > 2 8 implies B < ^iV 2 3/S . 

(Hi) N 2 > 2 8 and H/N 2 < N{ /& implies min {§ , J§} < Np. 

Proof. Combining H/N 2 < N^~ € with H > N\ yields Ni < N 2 Nl~ € , i.e., 
N 2 < N^ /e . Substituting N x in H < N 2 Nl~ e results in (i). 

For (ii), we have B < |log3fT < flog# since H > N 2 > 2 8 such 
that we can use 3 • 2 8 < 2 10 and note that | • | < |. Using (i), we get 

B < - log N 2 ^ e < — logN 2 . Finally, we simply use the additional observation 
logx < x 3 / 8 for x > 2 8 . 

The last results is obtained by rewriting the main assumption as 3H > 
min {§ , §p} 7B/2 - Again, using 3H < H 5 / 4 for H > N 2 > 2 8 , we get H > 
min j^p} liB ^ 5 ■ H in term is bounded from above by N 2 ^ such that we 
have min {§ , ^} < (jvf^/d^) < N W™) . " □ 
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Best-case to row major layout with multiple input pairs 

To begin with, we consider a task that is related to the matrix vector product 
but seems somewhat simpler. In [3], a copy task is described, where a matrix 
in column major layout is created from a vector such that each non-zero 
element in row i is a copy of the zth vector element. The lower bound for this 
task is quite similar to the bound for permuting in [T]. In |3], the copy task 
is only used as a preliminary analyses to reduce the matrix vector product to. 
Here, we actually can use the task itself to state a lower bound. Observe that 
the copy task is equivalent to the creation of a sparse matrix in row major 
layout where each non-zero entry in column j is a copy of the jth vector 
element. This corresponds to the special case of Map Reduce where a map 
function simply copies its input value H/Nm times with random intermediate 
indices which then have to be ordered by intermediate index. Thus, a lower 
bound for this task states a lower bound for the shuffle step with parallel map 
functions but non-parallel reduce. We extend this task further to v multiple 
input vectors, such that in the created matrix each non-zero entry in column 
j is a copy of of the jth vector element of one of the v vector. In the following, 
we bound the number of I/Os required for a family of programs for the copy 
task such that every matrix conformation (position of the non-zero elements) 
can be created by a program. 

Normalisations In the following, we make some normalisations to pro- 
grams which will not increase the number of I/Os. Therefore, it suffices to 
consider normalised programs only. Computational operations are not re- 
quired for the copy task and can hence be removed from the program with 
all their results. This can only reduce the number of I/Os. We further can 
remove any other element created during the program that does not run into 
the final matrix. Hence, input vector elements and any copies that do not 
run into the final result are removed throughout the whole program. In this, 
we also remove elements that are no long required in internal memory imme- 
diately after an I/O. Moreover, we eliminate copy operations within internal 
memory. Multiple copies of an input vector element in internal memory are 
useless, and copies can be generated by an output where elements reside 
in internal memory after the output. This step can again only reduce the 
number of elements in internal memory, and thus, the number of I/Os. 

Abstraction We abstract like in [3] from the actual configuration. Instead 
of the full information of an element, we consider only the set of (column 
index, origin vector index) tuples of the elements in a block or in inter- 
nal memory. Hence, blocks and internal memory are considered subsets of 
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{(1, 1), . . . , (Nm,v)} of size at most B and M, respectively. This abstracts 
especially from the ordering of elements and the multiplicity of elements with 
the same column index. 

Description of programs We consider the change of configurations over 
time. The initial abstract configuration is unique over all programs for fixed 
Nm- The final configuration in contrast depends on the conformation of the 
matrix that is generated. This number will be examined later on. 

For a given abstract configuration, we examine the number of possibly 
succeeding (abstract) configurations generated by different normalised pro- 
grams. To this end, first assume that processor i performs an input while 
all the other processors stay idle. After £ parallel I/Os, there can be at most 
H/B + P£ < 2P£ non-empty blocks in external memory. Hence, there are 
at most 2P£ blocks that can be read by the input. Afterwards, up to B 
new elements are added to internal memory of processor i. We assume that 
unneeded elements are removed right away, which allows a choice of 2 B such 
that there are at most 2 B 2£ succeeding configurations. Now consider the case 
processor % performs an output. For £ being the maximal number of I/Os, 
there are less than 4P£ positions available relative to the non-empty blocks to 
perform the output to: 2P(£ — 1) non-empty blocks that can be overwritten 
and another 2P{£ — 1) + 1 empty positions relative to the non-empty blocks. 
The content of the output block can consist of up to B out of the X^i el- 
ements in internal memory of processor i. Further unneeded elements can 
be removed after the output which constitutes another 2 B different possible 
configurations. 

Each of the P processors can perform either an input, an output or be 
idle during the Ith parallel I/O. Thus, there are up to 3 P YLi=i { Xi 'i^ B ) ' ^ B ■ 
4P£ possible configurations succeeding a given configuration. A family of 
normalised programs with £ parallel I/Os can lead to 

f p 

Yl 3 pYlf x ^+ B \ 2 B. 4Pi (1) 

1 = 1 8=1 ^ ' 

distinct abstract configurations. 

Different abstract matrix conformations For a family of programs be- 
ing able to produce all conformations, £ needs to be large enough such that 
([I]) is at least as large as the number of abstract configurations representing 

all conformations. There are { H /^^j R different conformations of Nr x Nm 
matrices with H/Nr non-zero entries per row. Furthermore, each of the 
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non-zero entries can stem from one of the v input vectors. However, since 
we consider abstract configurations and ignore the ordering and multiplicity 
of elements within a block, the number of final abstract configurations is 
less. For an abstract configuration, it is not clear wether a tuple present in 
a block stems from one or multiple rows, neither from which of them. As de- 
scribed in |3] the following cases have to be distinguished. If H/Nr = B each 
block corresponds to exactly one row such that each abstract configuration 
describes only a single conformation. In case H/Nr > B, a block contains 
entries from at most two rows. Hence, a column index in the abstract de- 
scription of a block can origin either from the first, second or both rows. For 
H/Nr < B, a block contains entries from BNr/H different rows. Then, the 
at most B indices in the abstract description can be column indices of each 
of the BNr/H rows. Thus, there are ( h ^ Nr ) blocks with the same abstract 
description. 

Theorem 3. Given block size B , internal memory M > 3B, number of 
mappers Nm > Q 1 ^ o,nd the number of processors P < ^. Creating a sparse 
Nr x Nm matrix with H non-zero entries in row major layout from v < 
H/N M vectors such that Oy = Xj for a 1 < k < v for each non-zero entry 
has (parallel) 1/ 0- complexity 

. (e 2 H H . (N m Nrv N m v\\ 
I > mm < , log . r m 2H \ mm < , > > . 

Proof. If a family of programs with I I/Os is able to create all conformations 
of Nr x Nm matrices in row major layout with H non-zero entries, then 

with 

( 3 H HB< H/Nr 

t r < I 1 if B = H/Nr 

{(eBN R /H) H if B> H/Nr 

has to hold. For £ > the claim is already proven, so we can assume in 
the following I < Similar, if Nm < 2 s , then the logarithm in Theorem^ 
is smaller than 7 such that the theorem is proven with a scanning bound of 
-jrg which is required for reading the input. Hence, also assume Nm > 2 8 . 
Taking logarithms and estimating binomial coefficients in ^ yields 

£ P 

£P(B + log 3H) + E B lo S e{Xl ' l B + ^ ~ H log ^JT + Hl °Z v - lo S r «- 

i=i i=i 
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Observe that for all I the term YliLi with Ylf=o < niin {PM, H} is 
maximised when Xij = ■ ■ ■ = Xpi = min {M, H/P}. Hence, we have 

WiR , mm{M,H/P} + l \ N M N R v 
iP I B + log 3H + B log e — J > H log — — log tr . 

After substituting r and isolating £, we obtain 

H logmin{^,^f} 



> 



P log 3tf + 5 log (2e(min {f , ^} + 1)) 
and using the assumptions M > 3-B and H/(PB) > 1, we get 

Jf logmin{M^,^} 



> 



Plog3F+| J Blogmin{f ,fg}' 



where we used that 2e(x + 1) < (2x) 7 / 2 for x > 1. 
Case 1: For log3# < |5 log min {§ , f§}, we have 

„ . H . f NmNrv N m v \ 
t > log . r m 2H i mm < , > . 

Case 2: If log3# > \B log min {f , |§}, we can use 3# < # 5 / 4 for # > 
-^Vm > 2 8 , and with Lemma [2}i, we obtain 

17 lognun{^,^} 

Using Lemma |2}ii with iVi = Nr and N 2 = Nm, and ignoring v, we have 

F l gmin{fi^,e< /8 } 



2 1V A/ 

For iVj/ > 3 2 / e in term, we obtain 



H log min {iV^eA^ 8 } g logiV;/ 2 e 2 ^ 



since 



eN 5 ^ 8 > N$ for N M > l/e^s and 9 1 / 6 > for all e > 0. □ 
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Mixed column layout with multiple output pairs 

Following pi], for this task, it suffices to consider the multiplication of a 
matrix with the all-ones- vector, i.e., the task of simply building row sums of 
the matrix. Similarly, we extend the task such that each elementary product 
is only used for the calculations of one of the w output vectors. Hence, we 
consider w independent tasks of building subsets of row sums. This task can 
be seen as a time-inverse variant of the copy task described in Section [AJ 
Instead of spreading copies of input elements, the scattered matrix elements 
of the same row have to be collected and summed up to w subset sums. 
Therefore, we can use a similar analysis as before, but consider the change 
of configurations backwards in time since there are multiple input forms 
depending on the conformation that can all create the same output. 

Normalisation Again, we describe some normalisations to programs for 
this task that do not increase the required number of I/Os. First, observe that 
each non-zero entry can only appear once in the final result since we do not 
guarantee inverse elements. Hence, copies of matrix entries can be removed 
from the program. Second, we assume that non-zero entries from the same 
row are summed immediately when present in internal memory. This can only 
decrease the number of elements in internal memory at a time, and hence 
the number of I/Os. Finally, an element in internal memory shall be deleted 
immediately after an I/O if it is no longer used. This implies that an element 
that is output will be removed from internal memory immediately after its 
output since we argued before that copying is useless. Furthermore, when 
an input is performed but some of the elements are removed immediately 
after the input, one can think of loading only the elements that are actually 
required. 

Abstraction We consider an abstract configuration similar to the section 
before. Here we consider the set tuples of row indices and destinations of 
the elements (or sums of elements). Note that sums can only consist of 
elements from the same row and the same destination, otherwise the created 
sum cannot run into the final result. Hence, internal memory and each block 
states a subset of {(1,1),..., (Nr, w)} of size up to M and B, respectively. 

Description of programs The proof for this task considers the change of 
configurations backwards in time from the final configuration to the initial. 
Because concurrent reads are possible in our model, we assume that elements 
in a block in external memory, that do not belong to the final output disin- 
tegrate magically, when read for the last time. This holds especially for the 
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case an output is performed to the ith block on disk. Since this will replace 
the elements present before, the former elements cannot be read any more 
and hence, disintegrated before. Thus, outputs are only performed to empty 
blocks. 

Consider the final configuration after £ I/Os. Since all blocks that do 
not belong to the output disintegrated, the final (abstract) configuration 
is unique for all programs that compute the matrix vector product for fixed 
Nr. In contrast, the initial configuration depends on the conformation of the 
matrix, the position of the non-zero elements. For a family of programs that 
create the matrix vector product for sparse Nr x Nm matrices, the changes 
of configurations can be consider as a tree rooted in the final configuration. 
Each leaf corresponds to a different matrix conformation and each layer of 
depth % corresponds to the configurations at time I — %. 

We normalised our programs such that computational operations are only 
done immediately after an I/O. Hence, it suffices to consider input and output 
operations only. Given a certain configuration after Z I/Os, we now need to 
count the number of possible preceding configurations. Note that since we 
normalised our programs to avoid copying matrix entries, at any point in 
time there can be no more than H non-empty blocks on disk. 

For the Zth parallel I/O, consider the case processor % performs an in- 
put. The Zth configuration can stem from several preceding configurations. 
To upper bound the number of preceding configurations, we assume that 
some elements of the input block disintegrated immediately after the input. 
This changes the configuration on disk, otherwise only the change of internal 
memory has to be considered. Again, there are at most H/B + P£ < 2P£ 
non-empty positions on disk to choose which block was input. Furthermore, 
there are at most B row indices input. For X i; i being the number of distinct 
row indices in internal memory after the Zth I/O, there are at most { Xl ' l + B ) 
possibilities which row indices belong to the input. Each of which could 
however have been present before, or appear as a new index. This makes an- 
other 2 B possibilities. Altogether, there are up to ( X ^ B )2 B 2P£ preceding 
configurations if processor i performs an input, while all other processors are 
idle. 

Now, consider the case that processor % performs an output. This alters 
one block on disk that was empty before. Hence, there are up to 4P£ possible 
preceding configurations. In the preceding configuration, the elements of the 
output block are in internal memory. However, when choosing the block 
position where the output is performed, these elements are determined by 
the (known) configuration after the output. 

Each of the P processors can perform either an input, an output or be 
idle during the Zth I/O. Thus, there are up to 3 P Ylf =1 [ X ^ B )2 B AP£ possible 
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preceding configurations. A family of programs with I parallel I/Os can hence 
create the matrix vector product from at most 



initial abstract configurations. This result is still independent from the layout 
of the matrix. 

Different abstract matrix conformations To gain a lower bound on 
the I/O-complexity of the mixed column layout, we have to lower bound 
(|3| by the number of different matrix conformations in mixed column layout 
expressed in abstract blocks on disk. We consider matrices with exactly 
H/Nm elements per column. Think of drawing the non-zero elements for each 
column one after another. For the number of non-zero entries per column 
H/Nm < Nr/2, there are at least (Nr/2) possibilities to draw the position 
of a non-zero element. Furthermore, each element can be involved in the 
computation of one of the w output vectors. In total, there are at least 
(Nr/2) h w h different matrix conformations. However, abstracting to the 
view of (row index, destination) tuples, the ordering of elements within a 
block gets hidden. Additionally, if a block contains elements from several 
columns, in its abstraction it is not clear from which column(s) a tuple may 
stem from. The number of different conformation that correspond to the 
same abstract conformation can be bounded from above by B H since each 
element can be one of the at most B row indices of its block. 

Theorem 4. Given block size B, internal memory M > 3B, number of 
reducers Nr > 1/ e 8 / 3 and the number of processors P < ^ . Creating the 
combined matrix vector product for a sparse Nr x Nm matrix in mixed column 
layout with H non-zero entries for w < H/Nr output vectors, with H/Nr < 
N\j e and H/Nm < N R ~ € for e > has (parallel) I/O-complexity 



Proof. A lower bound on the minimal number of I/Os i required for a family 
of programs that create the matrix vector product for Nr x Nm matrices 
with H entries in mixed column layout is given by 
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With similar arguments as in the proof of Theorem |3j we can assume I < 
^H/P, and Nr > 2 8 . Otherwise the theorem holds trivially. Taking loga- 
rithms and estimating binomial coefficients yields 

i p 

£P(B + log 3H) + J2J2 B lo § eB >H\og^ + Hlogw 



=1 i=l 



As described in Section [XJ the left-hand side is maximised for each / when 
Xi t i = ■ ■ ■ = Xp t i = min {M, H/P}. Thus, we obtain 

nr.fr, , ^, e min {M, H/P} + eB\ TT1 N R w 
IP [B + \og3H + B log i — ' J s > H log 



B J ~ ° 2B 

and isolating £, we can estimate 



> 



H log^ 



Plog3F+|51ogmin{f,||}- 



where we used again M > 3B and H / (PB) > 1. 

Case 1: For log3if < |P log min the lower bound matches the 

sorting algorithm: 

H N R w 
£ -7PB l ° g ^{^m}^B~ 
Case 2: For log3i7 > |P log min J/|}, we get 



P21og3# 

Using Lemma |2}i, and 3if > if 5 / 4 for .P > 2 8 , we have 

and with Lemma [2j.il, we get 

£> H\ogeKj(\v > e_E_ 
~ P \\ogNf ~ 10 P 

for iV/j > 1/ e 8 / 3 which is matched by the direct algorithm. □ 
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Column major layout with multiple output pairs 

For column major layout, the number of different abstract matrix confor- 
mations corresponds to the number of abstract conformation described in 
Section [A] but with Nm and Nr exchanged. 

If a family of programs with £ I/Os is able to create the matrix vector 
product with each Nr x Nm matrix with H non-zero entries to obtain w < 
H /Nr vectors of row sum subsets, then 

n3-n(^; s )2^>(^j N v A „ (s) 

with 

(3 H if B < H/N M 

t m < < 1 if B = H/N M 

{{eBN M /H) H if B>H/N M 
has to hold. This yields the following theorem. 

Theorem 5. Given block size B, internal memory M > 3B, number of 
mappers Nm > 9 1//e and the number of processors P < ^. Creating the 
combined matrix vector product for a sparse NrxNm matrix in column major 
layout with H non-zero entries for w output vectors, and H/Nr < N\j e and 
H/Nm < N R ~ e for e > has (parallel) I/O- complexity 

N m Nrw N r w\ \ 
3H ' eB J J ' 

Best-case layout with multiple input and output pairs 

For the best-case layout, the algorithm is allowed to choose the layout of the 
matrix. This makes the task of building row sums become trivial by setting 
the layout of the matrix to row major layout. Hence, we have to follow the 
movement and copying of input vector elements as well. 

For this task, we consider both, multiple input and multiple output vec- 
tors. Recall that any intermediate pair stems from exactly one input element, 
and it is required for the computation of only a single output element. For 
each of the H non-zero entries in our matrix there is not only a choice of its 
position but also the choice from which of the v element it stems from and 
which of the w elements is its destination. This results in a total number of 
( Nm x R )v h w h different tasks. We assume that v > H/N M and w > H /Nr 
such that all input and output elements can be useful. 



e 2 H H , f 

> mm < , log . r m 2H \ mm < 

15? 7PB &mm lir>ps} \ 
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The task of creating this type of matrix vector product can be seen as 
spreading input elements to create elementary products a^Xj , and then 
collecting these elementary products to form the output vectors. To describe 
these two main tasks, we distinguish between matrix entries a.y, elementary 

products dijXj and partial sums Yljes a ij x ^ 1 ^ which we trace as described 
in Section [AJ and input vector elements which will be followed as described 
below. To the first group (matrix entries, elementary products, partial sums), 
we also refer as row elements and we call the row index their index. 

Normalisation At first, we make again use of the normalisations described 
already. In doing so, we distinguish time between input vector elements and 
row elements. For row elements, we apply the normalisations described in 
Section [A] (multiple copies are removed, row elements are summed immedi- 
ately and unused elements are removed from internal memory after an I/O). 
For input vector elements, we use the normalisations of Section [AJ Further- 
more, we create an elementary product aijXj as soon as both elements 
reside in memory for the first time, i.e. immediately after an input. The 
elementary product is stored at the internal memory position where was 
present. This is possible since the variable is not required for any other 
operation than a multiplication with Xj , and can hence be replaced. This 
normalisation does not increase the memory usage at any time. 

Abstraction As the sections before and analogue to [4J, we consider an 
abstraction of the current configuration. In this, we define the set of tuples 
Mfj Q {(1, 1), • • • , (Nm, v)}, \M\j\ < M of input vector elements that are in 
internal memory of processor i after the jth parallel I/O. Similar, let -Mf^ C 
{(1,1),..., (Nr, w)}, \M.f.j | < M be the set of tuples of row elements residing 
in internal memory of processor % after the jth parallel I/O. Additionally, for 
each block on disk, we consider the set of tuples of input vector elements 
contained and call this over all blocks on disk together with ■ ■ ■ ^ e 
abstract input configuration after the jth parallel I/O. Analogously, we define 
the abstract row configuration after the jth parallel I/O to be -Mi j, . . . , Mpj 
together with the sets of tuples of row elements of each block on disk. 

Description of programs The final abstract row configuration is again 
unique over all programs that create the matrix vector product for fixed Nr. 
Hence, we can apply our analysis from Section [A] to describe the number of 
different abstract row configurations that can be present during the execution 
of a program with I I/Os after the Zth I/O. To describe the abstract input 
configurations, we can use the analysis from Section [A} Together, the number 
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of distinct (input vector / row) configurations that can result from a given 
(initial / final) configuration after £ parallel I/O is bounded by ([3]). 

So far, we described all operations concerning only input vector elements, 
or only row elements, but no interaction between these elements. There 
is only one operation that performs an interaction, and that is multiplying 
an input element with a non-zero entries which creates a row element. For 
each non-zero entry a^, exactly one multiplication with a x^^ is performed 
during the whole execution of the program. Hence, if in our abstraction, 
a row element with index % was created by multiplication from an input 
variable with index j, this fixes to be non-zero. We normalised pro- 
grams such that an elementary product is created immediately when both 
elements appear in internal memory together for the first time. This can 
only happen after an input. Thus, for each input there are at most B new 
elements in internal memory and some X it i elements that are already in 
internal memory. Hence, there are at most BXij possibilities where a mul- 
tiplication can be done for each processor after each I/O. In total, we get 
Y = Yll=i YliLi BXij < £B ■ min {PM, H} possibilities where a multiplica- 
tion can be performed. Together with the traces that describe the movement 
and copying / summing of input and row elements, this give a unique de- 
scription of the abstract matrix conformation. 

Theorem 6. Given block size B, internal memory M > 3B, and the num- 
ber of processors P < ^ . Creating the combined matrix vector product for 
a sparse Nr x Nm matrix in best-case layout with H non-zero entries for 
v < H/Nm input and w < H/Nr output vectors, with H/Nr < N 1 ^ and 

1 /6 

H/Nm < N R has (parallel) I/O- complexity 



Proof. With the above observations, for a family of programs with I I/Os 
that create the matrix vector product for any Nr x Nm matrix with H non- 
zero entries where the layout of the matrix can be chosen by the program it 
has to hold 



Like in the proofs before, we argue that i < jp and Nr > 2 8 , otherwise 
the claim holds trivially. The left-hand side is again maximised for X^i = 
min {M,H/P} for all Estimating binomials and taking logarithms, we 



£ > min 
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have 

tpfi mx 7 Ri ■ i M 2H X\^n(^ N mNrvw e£Bmm{PM, 
IP log 3H + -B log mm < — , — — > I > H log — log 



2 to \b : pb J y ~ V # # 

where we followed the calculations given for the other layouts. Reordering 
terms, we obtain 

tt log N M N R vw 

p > U S etBmm{PM,H} 



Plog3H + iBlogd 



2 

with d = min{M/B,2H/(PB)}. By Lemma A.2 in [4], x > logb( / /x) implies 

x > For x = I, s = eB ^ { Z, H} and t = | (log 3H + \B log d) this 

results in 

N mN RVwP(\og 3H+ 1 B log rf) 
£ y H ^ eBHmm{PM,H} 

~ 2P logZH +lB log d ' 
Case 1: For log3# < \B\ogd, we get a lower bound of 

tt loff N mNrvw 

~ 14P Plogd ' 
Case 2: For log3P > |Plog<i, we get 

J N M N R vw lo g 3H N N 

p > H " S eBHmin{M,H/P} > B lOg 

~ AP log 3/7 - 4P §log# 
for if > 2 8 . Setting H/Nm < A^J and using Lemma [^Ji with e = |, we get 



H log 



iV 



i/6 



i? 



S^logiV^ 5 

for Nr > 9. Using Lemma [2jiii, with N± = Nm and N 2 = Nr, we get 



\nI /6 



> H lQ g <X /(7a) g log K 3 > 



6P logiV R 6P log iV^ - 20P 

for B > 4, using iV H > 2 8 > (|) 30 such that fiV 1 / 3 > jy 3 / 10 . □ 
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Transposing bound 

Another method to obtain a lower bounds for the I/O-complexity is pre- 
sented in [lj. They use a potential function to lower bound the complexity of 
dense matrix transposition. This bound can also be applied to sparse matrix 
transposition, if the matrix is given in column major layout. 

In the following, we extend the potential to the PEM model. For each 
block j written on disk at time t, its togetherness rating is defined by 

H/B 

= £/(*«(*)) 
»=i 

where Xij(t) is the number of elements present in block j at time t that belong 
to the zth output block, and 



/(*) 



x log x for x > 
otw. 



Similarly, to internal memory of processor k, a togetherness rating of 

H/B 

Mfc(t) = X}/(v*(t)) 

8=1 

is assigned with yik[t) being the number of elements that belong to output 
block % and reside at time t in internal memory of processor k. The potential 
is then defined by 

P oo 

fc=i j=i 

In [T], it is shown that the change of the potential induced by an I/O is 
A$(t) < 2filog^. To extend the bound for the PEM-model, we have to 
restate this argument for parallel I/Os. 

Lemma 3. The increase of the potential during one parallel I/O is bounded 
by 

A , , min {M } H/P\ 

A$(i + 1) < PB log 2e + PB log 1 ' 1 s . 

B 

Proof. Obviously, an output can never increase the potential. An input of 
block j by processor k in contrast induces the following change in the poten- 
tial 

H/B 

A$(t + 1) = f(v*(t) + x tJ (t)) - f(y ik (t)) - f( Xij (t)) . 

i=l 
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Hence the maximal increase of the potential function is bounded by 

p h/b 

A*(f + 1) < E E + - fivM) - f(x ijk (t)) ■ 

k=l i=l 

where is the index of the block input by processor k. 

Let hj k (t) be the set of indices i such that Xij k (t) > 1 and yikit) > 1. 
Substituting /(x), we obtain 

k=nei kjk (t) Vik{) Xijk{) 

< v V ^ (t) i r mw^)i!^! 

< P51oge + ^ £ x .,( t) lo g |^ 

< P5iog 2e+ E E ^w^lnl- 



Now let Efe=iEi e / fc ^(t)^ fc (0 = X W and £ fc=1 £ie/ Wfc (t) S/« (*) = F W 
for 1 < fc < P. In order to upper bound A$(£ + 1), we substitute Xij k (t) = 
x ijk (t)/X(t) and y ik (t) = y ik {t)/Y{t) such that we get 



A$(t + 1) < PPlog2e + E E 

fc=l «G/ fejfc (t) 



log "7 TTT + log 

. x ijk {t) X(t) 



p 

= PPlog2e + X(t)log^+X(t)^ £^(f)logM|. 

The last sum describes a negative Kullback-Leibler divergence. Note that 
for fixed k and t, Xij k {t) and ^(i) constitute a probability distribution over 
{1, . . . , k} x Ikj k (t). It is well known that the Kullback-Leibler divergence is 
minimised when both probability distributions equal, and positive otherwise. 
Hence, since it appears in a negative form, the last term is upper bounded 
by 0. 

Considering the second term, we obviously have X(t) < PB and Y(t) < 
min {PM,H}. Thus, X(t) log J|| < X(t) log min{ /ff H} which is in term 
upper bounded by B log min { ^ , y- } since min {PM, if} > P. □ 
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Since we can exclude copy operations for this task, the final potential 
is obviously $(£) = HlogB. The initial potential in contrast has to be 
considered for each matrix separately. W.l.o.g., in the following let Nm > Nr. 
For H/Nm > B, any matrix that is row-wise H/NM-Tegu\ai and column- 
wise if/iV#-regular has an initial potential of $(0) = 0: Observe that in 
this case each input block intersects with an output block in at most one 
element. Hence, the potential yields a lower bound of Q (-^log d S) with 
ci = min{#,^}. 

For H/Nm < B, we consider the following matrix. All non-zero entry ay- 
are located at coordinates where (i — 1)H/Nr + 1 < j < iH /Nr mod Nm 
holds. Observe that any matrix with such a conformation stored in column 
or row major layout corresponds directly to a dense H/Nm x Nm matrix in 
column major layout, row major respectively. Thus, conform to [1] we get 

an initial potential of $(0) = h log max jl, B/-^-, B/N x , B 2 /h^. Hence, we 

get for this dense matrix a lower bound of 

^log.minfB.A,^!}). (6) 
Combining the bounds 

Consider a matrix given in column major layout. For the scenario Nm > B > 
H/Nm, the minimum breaks down to the term H/Nm- Combining the results 
with the above bound we get Q yj^ {\og d + \og d ^^S- j j 

:rom below by (-^ log d Nr) . Similar observations hold for 
Nr > B > H/Nr. Considering the other cases for the minimum in (J6|, we 
get a lower bound of 



from Theorem 
which is bounc 



o f ■ S H H i • / N mNrB H 
Vt I mm I — , — log d mm I — , N M , N R , — 

I/Os for matrices in column major layout. 

Given a matrix in mixed column layout, we can apply (|6| as well since 
column major is a special case of the mixed column layout. Hence, with 
similar considerations, we obtain a lower bound of 

/ . ( H H . f H 

VI I mm j — , — log d mm j N R , — 

I/Os for matrices in mixed column layout. 
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